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ABSTRACT 

MODOMICS is a database of RNA modifications that 
provides comprehensive information concerning the 
chemical structures of modified ribonucleosides, 
their biosynthetic pathways, RNA-modifying 
enzymes and location of modified residues in RNA 
sequences. In the current database version, access- 
ible at http://modomics.genesilico.pl, we included 
new features: a census of human and yeast 
snoRNAs involved in RNA-guided RNA modification, 
a new section covering the 5 -end capping process, 
and a catalogue of 'building blocks' for chemical 
synthesis of a large variety of modified nucleosides. 
The MODOMICS collections of RNA modifications, 
RNA-modifying enzymes and modified RNAs have 
been also updated. A number of newly identified 
modified ribonucleosides and more than one hun- 
dred functionally and structurally characterized 
proteins from various organisms have been added. 
In the RNA sequences section, snRNAs and 
snoRNAs with experimentally mapped modified 
nucleosides have been added and the current col- 
lection of rRNA and tRNA sequences has been sub- 
stantially enlarged. To facilitate literature searches, 
each record in MODOMICS has been cross- 
referenced to other databases and to selected key 
publications. New options for database searching 
and querying have been implemented, including a 
BLAST search of protein sequences and a 



PARALIGN search of the collected nucleic acid 
sequences. 

INTRODUCTION 

During the course of RNA maturation, various enzymes 
are able to introduce chemical modifications into 
ribonucleotide residues. Chemical alteration may occur 
in the base, at the 2'-hydroxyl of the ribose, or both. 
Indeed, many modified residues in fact correspond to 
intermediates in the sequential, multistep formation of 
hypermodified nucleotides (1). There are also modifica- 
tions whose biosynthesis starts outside the RNA. For 
example, queuosine derivatives arise from azaguanine, in 
which extra chemical groups are attached to C7 (rather 
than N7 as in guanine). These compounds are introduced 
into RNA by transglycosylation involving the replacement 
of an original unmodified guanine by a modified pre-Q 
base (2). Likewise, in some positive-sense RNA viruses, 
such as those of the family Alphaviridae, the mRNA cap 
is formed by nucleotidyltransfer involving a GTP nucleo- 
tide pre-modified to jV 7 -methyl-GTP (m 7 GTP) (3). The 
variety of chemical groups and their positions in natively 
modified nucleosides in RNA is illustrated in Figure 1. 

The location, abundance and distribution of various 
types of modification vary greatly between different 
RNA molecules, organisms and organelles. Physiological 
environment and growth conditions of the cell also affect 
the pattern of RNA modification and/or the degree of 
individual modifications among different RNA molecules 
of the same type (4). While the majority of modified nu- 
cleosides are present in transfer and ribosomal RNAs of 
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Figure 1. The diversity of chemical groups introduced enzymatically during the process of RNA maturation at various locations of the four common 
ribonucleosides A, G, C and U linked via 5'- to 3'-phosphodiester bonds. Only conventional acronyms/symbols are shown. The complete scientific 
names of all modified nucleosides presented in this figure are provided in the MODOMICS database. Nucleosides in RNA can also be doubly or 
triply modified by the introduction of several of these chemical groups. 



all types of cells, they have also been found to occur in 
small non-coding RNAs, such as spliceosomal (sn)RNAs, 
small nucleolar (sno)RNAs and more recently in regula- 
tory RNAs, such as siRNAs, miRNAs and piRNAs (5,6). 



The presence of modified bases in mRNAs and viral 
RNAs and their potential role in the regulation of gene 
expression has been recently intensively studied (7-9). 
New types of modifications have been found (10-16), 
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and biochemical and physiological roles have been 
revealed for many known modified ribonucleosides. 
Examples include the immune response to rRNA and 
tRNA methylation (17,18), the linkage between tRNA 
modification and host resistance to viral infection (19) 
and stress-induced cleavage of small RNAs (20). Many 
of these advances were driven by the use of synthetic 
RNA containing naturally occurring modified nucleotides. 
Moreover, numerous new RNA-modifying enzymes have 
been identified and characterized (21-26). To adequately 
represent this rapid accumulation of knowledge, we have 
added both to the variety and volume of data in the 
MODOMICS database. The most significant additions 
are: (i) snoRNAs linked to the corresponding modification 
sites in human and yeast RNAs; (ii) modifications in 
snRNAs and snoRNAs; (hi) update of the recently 
identified enzymes and pathways; (iv) a catalogue of 
'building blocks' for the chemical synthesis of naturally 
occurring modified nucleosides. The implementation of 
BLAST (27) and PARALIGN (28) sequence search 
engines facilitates access to MODOMICS data on the 
level of protein and nucleic acid sequences. Finally, 
greater functionality has been added to the user interface. 

DATABASE CONTENT 

The MODOMICS database (http://modomics.genesilico 
.pi) has been developed to house and distribute collections 
of RNA modification pathways, chemical structures of 
modified nucleosides, sequences of modified RNAs, 
enzymes responsible for individual reactions and a cata- 
logue of 'building blocks 1 for chemical synthesis of 
modified RNA. MODOMICS was created as a single 
resource to organize and present all these data in a con- 
venient and straightforward way. Information about 
modified residues is also available in the RNMDB 
database (29). General-purpose pathway databases, such 
as REACTOME (30) also present some aspects of RNA 
modification pathways. However, MODOMICS is cur- 
rently the most comprehensive source of information 
among all existing RNA modification databases. 

MODIFICATIONS 

At present, MODOMICS contains 144 different modifica- 
tions that have been identified in RNA molecules [34 were 
added since the previous database release (31)]. A typical 
entry for a modified ribonucleoside contains information 
about its basic chemical properties, localization in known 
RNA molecule types, the phylogenetic distribution with 
respect to Domains of Life and known enzymes respon- 
sible for its biosynthesis. The list of modified nucleosides 
can be browsed by the modification names, the standard 
bases (A, G, C and U) from which they originate and the 
chemical groups they contain. The available details 
contain full and short names, the sum formula, 
PubChem ID and — to facilitate MS analyses of modified 
RNAs — the monoisotopic, HRMS and average masses. 
The chemical structures of the modified nucleosides are 
represented by ID SMILE codes, 2D structure plots and 



3D structures in the mol format displayed interactively on 
the website by a Jmol applet. Reactions linking a modified 
nucleoside to its precursor(s) and to hypermodifications 
are listed. Many of the products of modification reactions 
are substrates of further reactions, and the formation of 
hypermodified residues occurs in complex pathways, 
which are displayed as graphs. All modified nucleosides 
found in RNA structures deposited in the RCSB Protein 
Data Bank are also indicated and appropriate hyperlinks 
are provided. 

PATHWAYS 

MODOMICS comprises a collection of RNA modifica- 
tion pathways divided into six different categories accord- 
ing to their starting point: four categories correspond to 
the standard bases (A, G, C and U), another presents the 
incorporation and hypermodification pathway of 
queuosine and the other the modifications of the RNA 
5'-cap. The pathway display in MODOMICS has 
undergone a radical overhaul. The new display, which is 
based on the Cytoscape Web network visualization tool 
(32), allows users to zoom and change the graph layout, 
and to save the obtained result as an image (png or svg), 
pdf or xml file. Pathway graphs are now easier to navigate 
and present information about the structures of modified 
nucleosides and the type of chemical reactions involved. 
Information about enzymatic transformations that have 
been verified experimentally (plain arrows) are distin- 
guished from 'putative' reactions that are predicted but 
not yet experimentally confirmed (dashed arrows). Users 
can access detailed information about each reaction 
(including the type of transformation and enzymes respon- 
sible for its execution) by simply clicking the chosen arrow 
of the graph. The display of pathway graphs is currently 
supported for Google Chrome, Mozilla Firefox and for 
Microsoft Internet Explorer 9 (with Document Mode set 
to Internet Explorer 9 standards). 

RNA SEQUENCES 

MODOMICS provides a collection of modified RNA 
sequences of different types, such as tRNAs, rRNAs, 
snRNAs and snoRNAs. For families of homologous 
RNAs, multiple sequence alignments are available. 
Sequences are visualized with all modifications highlighted 
and linked to the corresponding modification record. The 
alignments can be displayed directly on the webpage or in 
a Jalview applet, or downloaded in plain text format. In 
comparison to the previous release of MODOMICS, the 
set of tRNA sequences has been updated and greatly 
expanded. In particular, in addition to the previously 
collected manually curated sequences, MODOMICS 
contains now all tRNAs imported from the Transfer 
RNA database (33), which were found to possess modified 
nucleosides (over 500 tRNA sequences in total). The sec- 
ondary structure of tRNAs is also indicated now. rRNA 
sequence alignments and secondary structure indications 
have been updated as well, based on the data from the 
Comparative RNA Website (CRW) (34). MODOMICS 
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includes an arbitrarily selected subset of rRNA sequences 
representing different phylogenetic taxa, based on those 
from the CRW database. For these sequences, both the 
positions of modified residues and the identities of rRNA- 
modifying enzymes are known. The current release (as of 
September 2012) contains 10 SSU rRNA sequences (5 
from Bacteria, 2 from Archaea and 3 from Eukaryota) 
and 9 LSU rRNA sequences (4 from Bacteria, 2 from 
Archaea and 3 from Eukaryota). For the cytoplasmic 
LSU in Eukaryota, 5.8S and 28S rRNAs are presented 
as a single-fused molecule, as in the CRW database. 
Currently, only one representative rRNA SSU and one 
LSU sequence per species is included. In future, we 
intend to expand this data set to include rRNAs from 
additional species, as well as to cover all rRNA variants 
encoded by a given genome. To map modifications onto 
rRNA sequences, we used data from the 'The Small 
Subunit rRNA Modification Database' (35), the 3D 
rRNA modification maps database (36) and the recently 
published data concerning modifications in both riboso- 
mal subunits. Finally, we have improved MODOMICS by 
adding modified snRNA and snoRNA sequences. 
Unmodified snRNA and snoRNA sequences and align- 
ments were obtained from the Rfam database (37). 
Positions of modifications in snRNA and snoRNA se- 
quences were included based on the published data. 

A new utility that allows mapping the modified pos- 
itions on secondary structure diagrams of RNA molecules 
has been implemented. All modified positions from se- 
quences collected in MODOMICS can be mapped onto 
reference diagrams based on the sequence alignments. For 
rRNAs, we used the structure of Escherichia coli SSU and 
LSU rRNAs obtained from the CRW as a reference. For 
tRNAs, we generated a consensus secondary structure 
diagram using VARNA (38), based on the data obtained 
from the Transfer RNA database. Graphics are presented 
using the JavaScript Info Vis Toolkit library (http://thejit 
.org/: Nicolas Garcia Belmonte). It is possible to map in- 
formation from a user-selected set of sequences onto the 
diagram. In such a case, the percentage of modified 
ribonucleosides of any type in each alignment position is 
calculated and displayed. The resulting diagrams can be 
downloaded as image files. 



PROTEINS 

The MODOMICS database currently contains informa- 
tion about 274 proteins involved in RNA modification, 
both functional enzymes and protein co-factors necessary 
for multi-protein enzymatic activities. More than one 
hundred functionally and structurally characterized 
proteins have been added since the previous release (31) 
and the collection of protein sequences has been updated 
accordingly. We expanded the collection to include not 
only the functionally characterized RNA-modifying 
enzymes from E. coli and Saccharomyces cerevisiae, but 
also from other organisms, in particular if their crystal 
structures were available. 'Predicted' enzymes, whose 
activity has not been experimentally validated by genetic 
or biochemical methods, are currently excluded from 



MODOMICS. Enzymes that have been characterized 
biochemically in vitro or in vivo, but for which the corres- 
ponding genes have not been identified, are also excluded 
(although the corresponding reactions are collated). 

The MODOMICS catalogue of proteins can be browsed 
by the source organism and/or type of the enzyme activity 
(methyltransferase, pseudouridine synthase, etc.). A list of 
matches can be further edited by the user, based on the 
features, such as name, position of modification, GI, 
COGs, PDB ID of structure, etc. At the level of individual 
protein entries, the database provides information about 
protein name(s), synonyms, amino acid sequence, corres- 
ponding ORF, modified RNA(s) and the position of the 
residues modified (if available). For proteins that are parts 
of enzymatic complexes, the name of the complex is 
provided. Reactions known to be catalysed by each 
protein with enzymatic activity are listed. Accession 
numbers from the Swiss-Prot (39) and GenPept (40) data- 
bases are provided and proteins with experimentally 
determined structures are linked to appropriate entries in 
the Protein Data Bank (41). 

snoRNAs 

As a new feature, we have included a census of human and 
yeast snoRNAs, involved in RNA-guided RNA modifica- 
tion by the C/D box and H/ACA box ribonucleoproteins, 
and we linked these snoRNAs to the corresponding modi- 
fication sites in human and yeast RNAs. For this section 
of MODOMICS, we used data from the Saccharomyces 
Genome Database (42), the Yeast snoRNA Database (43) 
and snoRNA-LBME-db (44). The list of snoRNAs can be 
browsed by organism and/or type of modification found 
in the target position. Links to the HGNC database (45) 
and the Yeast snoRNA Database are provided for human 
and yeast snoRNAs, respectively. 

BUILDING BLOCKS 

In this MODOMICS database release, we have added a 
catalogue of 'building blocks' for the chemical synthesis of 
naturally occurring modified nucleosides. Each modifica- 
tion is uniquely characterized by its IUPAC name and 
CAS number, but more than one building block may be 
available for a given modification. The compilation is 
intended to facilitate solid phase synthesis of modified 
RNA, and thus to foster biophysical and biochemical 
studies. It provides a rapid overview of which modifi- 
cations may be chemically incorporated with little or 
moderate synthetic effort, and inversely, which modifi- 
cations remain attractive targets for synthetic bioorganic 
chemistry endeavours. The listing was compiled from 
the CAS database. As not all CAS entries are contained 
in PubMed, the relevant literature pertaining to synthesis 
and incorporation of building blocks is given. Its contents 
reflect the overall dominance of 'classical' phospho- 
ramidite chemistry, featuring an acid labile 5'0-protecting 
group and a fluoride labile 2'0-protecting group. 
Protective groups that have proven useful in published 
syntheses are given, including conditions for chemical 
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deprotection after RNA synthesis. The protective groups 
themselves have separate entries, and further information 
is available on the reagent used for their introduction, and 
cross references to other building blocks containing the 
same protecting group. 

SEARCH 

New options for database searching and querying have 
been implemented, including a BLAST (27) search of 
protein sequences and a PARALIGN (28) search of 
nucleic acid sequences collected in MODOMICS, as 
well as a utility that sends a protein sequence from a 
MODOMICS entry to BLAST on the NCBI webserver. 
Hits and query-hit alignments resulting from 
MODOMICS searches can be downloaded in fasta format. 

FUTURE PROSPECTS 

The total number of confirmed modifications and RNA- 
modifying enzymes is growing continuously. Though there 
is a considerable amount of experimentally derived infor- 
mation available, there are still many modified positions in 
well-characterized RNA molecules for which the respon- 
sible enzymes are not known. New modified nucleosides 
are also being discovered, especially in RNA originating 
from more recently adopted model systems, such as 
extremophilic prokaryotes. Projects aimed at systematic- 
ally studying genomes of related organisms holds promise 
of many more surprises [e.g. the recent initiative to 
sequence 5000 insects genomes (46)]. Thus, characteriza- 
tion of RNA modification pathways appears to be a 
moving target, and we appreciate feedback from users 
who bring newly discovered modifications and enzymes 
to our attention, and make suggestions of new methods 
of data presentation. In the future, we plan to further 
develop the graphical presentation of RNA modification 
data to allow comparative analysis of modification 
pathways and modification positions in sequences with 
respect to chosen taxa. An ultimate goal is to integrate 
MODOMICS with databases on other aspects of RNA 
metabolism. We also intend to link the information avail- 
able in MODOMICS with 'RNAcentral', the planned 
database of RNA sequences (47). 

AVAILABILITY 

The data are freely accessible for research purposes at 
http://modomics.genesilico.pl. Most of the data are avail- 
able for download in plain text formats. Modified nucleo- 
sides and building blocks are also available as structure 
files and images. Images of pathways are available for 
download from the web page in several formats. The 
pathway graphs can be also downloaded as an xml file 
(graphml format). Program code for parsing the plain 
text formats is available on request. 
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